Scrapy 3: crawling all pages

The last notebook (scrapy 2) provided Scrapy code for scraping a page from quotes.toscrape.com. Yet, there are several other pages on this website that one may need to scrape. Which means, we have to actually create a Spider that do the same scraping tasks for all the URLs, not just one. That can be implemented in several ways, but first of all, let's start a new project and generate a new spider.

To start a new project, open the command prompt (move to the Data_Scraping folder, if you always do so) and run the following command:

scrapy startproject quote_pages

So now move to the newly created folder and generate a new spider (called quote_all) for getting data from quotes.toscrape.com as follows:

cd quote_pages scrapy genspider quote_all quotes.toscrape.com

The spider we will create is basically the same we had before (that scraped the same page and yielded a JSON file) just with some small changes. So let's copy the code from out spider and paste it inside the newly generated quote_all.py file.


In [1]:
# -*- coding: utf-8 -*-
import scrapy


class QuoteSpider(scrapy.Spider):
    name = "quote"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/page/1/',
                  'http://quotes.toscrape.com/page/2/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

As you can see the very first (and brutal) approach can be adding the URLs one-by-one to the start_urls list. The good news is that all URLs are quite similar: the only difference is the page number. This means we can construct URLs from three components as follows: URL = 'http://quotes.toscrape.com/page/' + '1' + '/' where the 2nd component (in this case 1) is the only variable component. If you check manually, you will see that there are 10 pages overall that include quote data. Which means, we can create each separate link using range() function and append them to the start_urls empty list as follows:

start_urls = [] for i in range(1,11): URL = 'http://quotes.toscrape.com/page/' + str(i) + '/' start_urls.append(URL)

Thus, the overall function after the abovementioned change will look like this (P.S. also, change the name variable value as we do not want to have 2 scrapers with the same name):


In [2]:
# -*- coding: utf-8 -*-
import scrapy


class QuoteSpider(scrapy.Spider):
    name = "quote_new"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = []
    for i in range(1,11):
        URL = 'http://quotes.toscrape.com/page/' + str(i) + '/'
        start_urls.append(URL)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

The same, of course, could be achieved using a while loop as follows:


In [3]:
# -*- coding: utf-8 -*-
import scrapy


class QuoteSpider(scrapy.Spider):
    name = "quote_new"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = []
    i=1
    while i<11:
        URL = 'http://quotes.toscrape.com/page/' + str(i) + '/'
        start_urls.append(URL)
        i+=1

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

This approach is easy and user firendly, yet it requires you to know the overall number of pages (10, in our case). A smarter solution would be the one that will not requore you to have this information. If you take an attentive look you will notice that there is a Next button on each single page and there is only one page which is missing the Next button: the last page. The button includes a hyperlink to each next page. As there is not next page for the last one, there is no next button on it. Which means we can navigate over pages by finding the hyperlink under the next button. It can be found with following code, which is using CSS selectors to find a list item (li) with a class next, then find an <a> tag inside the list item and get the value of its href attribute:

next_page = response.css('li.next a::attr(href)').extract_first()

If we are on the very first page, the value of the next_page guy will be /page/2/. Then this will be the absolute link of the 2nd page:

new_link = 'http://quotes/toscrape.com' + next_page

To finalize the code what we need to do is to first check whether there is any next button (any next_page url) and if so, then yield a new request to the new url as follows:

if next_page is not None: yield scrapy.Request(new_link)

The code above must be added inside the defined parse() function (but outside the for loop). Thus, the full code will look like this.


In [5]:
# -*- coding: utf-8 -*-
import scrapy


class QuoteSpider(scrapy.Spider):
    name = "quote_new"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["http://quotes.toscrape.com/"]
    
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }
        next_page = response.css('li.next a::attr(href)').extract_first()
        new_link = "http://quotes.toscrape.com" + next_page

        if next_page is not None:
            yield scrapy.Request(new_link)

Excellent! The code above must work, and to check it we can run a command that will generate a JSON file from the scraped data as follows:

scrapy crawl quote_new -o all_page_data.json